sgemm: reuse loaded vector in AVX dot product calculation by GermanAizek · Pull Request #17648 · ggml-org/llama.cpp

GermanAizek · 2025-12-01T11:03:53Z

This change optimizes the AVX-based sgemm (single-precision general matrix multiplication) kernel by introducing a local __m256i variable, avec, to cache the result of load(A + lda * (ii + i) + l). Previously, this memory load was redundantly performed four times for each iteration within the updot calls for Cv[0][i] through Cv[3][i].

By loading vector once and reusing it, the code eliminates these redundant memory accesses, reducing memory latency and improving instruction-level parallelism. This is a common subexpression elimination (CSE) optimization, crucial for performance in tight loops of vectorized kernels.

References:

Co-Authored-By: Gemini 2.5 Pro (References and desc commit changes)

This change optimizes the AVX-based `sgemm` (single-precision general matrix multiplication) kernel by introducing a local `__m256i` variable, `avec`, to cache the result of `load(A + lda * (ii + i) + l)`. Previously, this memory load was redundantly performed four times for each iteration within the `updot` calls for `Cv[0][i]` through `Cv[3][i]`. By loading vector once and reusing it, the code eliminates these redundant memory accesses, reducing memory latency and improving instruction-level parallelism. This is a common subexpression elimination (CSE) optimization, crucial for performance in tight loops of vectorized kernels. References: * [Common Subexpression Elimination - Wikipedia](https://en.wikipedia.org/wiki/Common_subexpression_elimination) * [Optimizing with Intel AVX2 - Intel Developer Zone](https://www.intel.com/content/www/us/en/developer/articles/technical/optimizing-with-intel-avx2.html) * [SIMD performance: data alignment and memory access - Daniel Lemire's Blog](https://lemire.me/blog/2012/05/31/simd-performance-data-alignment-and-memory-access/) * [Loop Optimization in Compiler Design - GeeksforGeeks](https://www.geeksforgeeks.org/loop-optimization-in-compiler-design/) * [Performance Optimization - CPU Caches and Memory Hierarchy - Princeton University](https://www.cs.princeton.edu/courses/archive/fall09/cos333/lectures/17_perf.pdf)

GermanAizek requested a review from ggerganov as a code owner December 1, 2025 11:03

github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label Dec 1, 2025

pwilkin added the vibe-coded Created with heavy use of LLM assistants, requires human verification label Dec 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sgemm: reuse loaded vector in AVX dot product calculation#17648

sgemm: reuse loaded vector in AVX dot product calculation#17648
GermanAizek wants to merge 1 commit into
ggml-org:masterfrom
GermanAizek:reuse-vector

GermanAizek commented Dec 1, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

GermanAizek commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

GermanAizek commented Dec 1, 2025 •

edited

Loading